random subspace
Towards a Systematic Approach to Design New Ensemble Learning Algorithms
Mendes-Moreira, João, Mendes-Neves, Tiago
Ensemble learning has been a focal point of machine learning research due to its potential to improve predictive performance. This study revisits the foundational work on ensemble error decomposition, historically confined to bias-variance-covariance analysis for regression problems since the 1990s. Recent advancements introduced a "unified theory of diversity," which proposes an innovative bias-variance-diversity decomposition framework. Leveraging this contemporary understanding, our research systematically explores the application of this decomposition to guide the creation of new ensemble learning algorithms. Focusing on regression tasks, we employ neural networks as base learners to investigate the practical implications of this theoretical framework. This approach used 7 simple ensemble methods, we name them strategies, for neural networks that were used to generate 21 new ensemble algorithms. Among these, most of the methods aggregated with the snapshot strategy, one of the 7 strategies used, showcase superior predictive performance across diverse datasets w.r.t. the Friedman rank test with the Conover post-hoc test. Our systematic design approach contributes a suite of effective new algorithms and establishes a structured pathway for future ensemble learning algorithm development.
Heterogeneous federated collaborative filtering using FAIR: Federated Averaging in Random Subspaces
Desai, Aditya, Meisburger, Benjamin, Liu, Zichang, Shrivastava, Anshumali
Recommendation systems (RS) for items (e.g., movies, books) and ads are widely used to tailor content to users on various internet platforms. Traditionally, recommendation models are trained on a central server. However, due to rising concerns for data privacy and regulations like the GDPR, federated learning is an increasingly popular paradigm in which data never leaves the client device. Applying federated learning to recommendation models is non-trivial due to large embedding tables, which often exceed the memory constraints of most user devices. To include data from all devices in federated learning, we must enable collective training of embedding tables on devices with heterogeneous memory capacities. Current solutions to heterogeneous federated learning can only accommodate a small range of capacities and thus limit the number of devices that can participate in training. We present Federated Averaging in Random subspaces (FAIR), which allows arbitrary compression of embedding tables based on device capacity and ensures the participation of all devices in training. FAIR uses what we call consistent and collapsible subspaces defined by hashing-based random projections to jointly train large embedding tables while using varying amounts of compression on user devices. We evaluate FAIR on Neural Collaborative Filtering tasks with multiple datasets and verify that FAIR can gather and share information from a wide range of devices with varying capacities, allowing for seamless collaboration. We prove the convergence of FAIR in the homogeneous setting with non-i.i.d data distribution. Our code is open source at {https://github.com/apd10/FLCF}
Fine-tuning Happens in Tiny Subspaces: Exploring Intrinsic Task-specific Subspaces of Pre-trained Language Models
Zhang, Zhong, Liu, Bang, Shao, Junming
Pre-trained language models (PLMs) are known to be overly parameterized and have significant redundancy, indicating a small degree of freedom of the PLMs. Motivated by the observation, in this paper, we study the problem of re-parameterizing and fine-tuning PLMs from a new perspective: Discovery of intrinsic task-specific subspace. Specifically, by exploiting the dynamics of the fine-tuning process for a given task, the parameter optimization trajectory is learned to uncover its intrinsic task-specific subspace. A key finding is that PLMs can be effectively fine-tuned in the subspace with a small number of free parameters. Beyond, we observe some outlier dimensions emerging during fine-tuning in the subspace. Disabling these dimensions degrades the model performance significantly. This suggests that these dimensions are crucial to induce task-specific knowledge to downstream tasks.
Regularized ERM on random subspaces
Della Vecchia, Andrea, De Vito, Ernesto, Rosasco, Lorenzo
We study a natural extension of classical empirical risk minimization, where the hypothesis space is a random subspace of a given space. In particular, we consider possibly data dependent subspaces spanned by a random subset of the data, recovering as a special case Nystrom approaches for kernel methods. Considering random subspaces naturally leads to computational savings, but the question is whether the corresponding learning accuracy is degraded. These statistical-computational tradeoffs have been recently explored for the least squares loss and self-concordant loss functions, such as the logistic loss. Here, we work to extend these results to convex Lipschitz loss functions, that might not be smooth, such as the hinge loss used in support vector machines. This unified analysis requires developing new proofs, that use different technical tools, such as sub-gaussian inputs, to achieve fast rates. Our main results show the existence of different settings, depending on how hard the learning problem is, for which computational efficiency can be improved with no loss in performance.
How many degrees of freedom do we need to train deep networks: a loss landscape perspective
Larsen, Brett W., Fort, Stanislav, Becker, Nic, Ganguli, Surya
A variety of recent works, spanning pruning, lottery tickets, and training within random subspaces, have shown that deep neural networks can be trained using far fewer degrees of freedom than the total number of parameters. We explain this phenomenon by first examining the success probability of hitting a training loss sub-level set when training within a random subspace of a given training dimensionality. We find a sharp phase transition in the success probability from $0$ to $1$ as the training dimension surpasses a threshold. This threshold training dimension increases as the desired final loss decreases, but decreases as the initial loss decreases. We then theoretically explain the origin of this phase transition, and its dependence on initialization and final desired loss, in terms of precise properties of the high dimensional geometry of the loss landscape. In particular, we show via Gordon's escape theorem, that the training dimension plus the Gaussian width of the desired loss sub-level set, projected onto a unit sphere surrounding the initialization, must exceed the total number of parameters for the success probability to be large. In several architectures and datasets, we measure the threshold training dimension as a function of initialization and demonstrate that it is a small fraction of the total number of parameters, thereby implying, by our theory, that successful training with so few dimensions is possible precisely because the Gaussian width of low loss sub-level sets is very large. Moreover, this threshold training dimension provides a strong null model for assessing the efficacy of more sophisticated ways to reduce training degrees of freedom, including lottery tickets as well a more optimal method we introduce: lottery subspaces.
Subspace Inference for Bayesian Deep Learning
Izmailov, Pavel, Maddox, Wesley J., Kirichenko, Polina, Garipov, Timur, Vetrov, Dmitry, Wilson, Andrew Gordon
Bayesian inference was once a gold standard for learning with neural networks, providing accurate full predictive distributions and well calibrated uncertainty. However, scaling Bayesian inference techniques to deep neural networks is challenging due to the high dimensionality of the parameter space. In this paper, we construct low-dimensional subspaces of parameter space, such as the first principal components of the stochastic gradient descent (SGD) trajectory, which contain diverse sets of high performing models. In these subspaces, we are able to apply elliptical slice sampling and variational inference, which struggle in the full parameter space. We show that Bayesian model averaging over the induced posterior in these subspaces produces accurate predictions and well calibrated predictive uncertainty for both regression and image classification.
Linear Dimensionality Reduction in Linear Time: Johnson-Lindenstrauss-type Guarantees for Random Subspace
Randomized dimensionality reduction techniques, such as random projection (RP) [7, 15] and Ho's random subspace method (RS) [12] are popular approaches for data compression, with many empirical studies showing the utility of both for machine learning and data mining tasks in practice [26, 11, 21, 19, 18, 27]. For RP a key theoretical motivation behind their use is the Johnson-Lindenstrauss lemma (JLL), the usual constructive proof of which also implies an algorithm with high-probability geometry preservation guarantees for projected data. However RP is costly to apply to large or high-dimensional datasets since it requires a matrix-matrix multiplication to implement the projection, and furthermore the projected features may be hard to interpret. On the other hand RS is a particularly appealing approach for dimensionality reduction because it involves simply selecting a subset of data feature indices randomly without replacement, and so does not require a matrix-matrix multiplication to implement the projection and it retains (a subset of) the original features. RS is therefore computationally far more efficient in practice, and more interpretable than RP, but there is little theory to explain its effectiveness. Focusing on this latter problem, here we prove data-dependent norm-preservation guarantees for data projected onto a random subset of the data features.
Sequential Low-Rank Change Detection
Detecting emergence of a low-rank signal from high-dimensional data is an important problem arising from many applications such as camera surveillance and swarm monitoring using sensors. We consider a procedure based on the largest eigenvalue of the sample covariance matrix over a sliding window to detect the change. To achieve dimensionality reduction, we present a sketching-based approach for rank change detection using the low-dimensional linear sketches of the original high-dimensional observations. The premise is that when the sketching matrix is a random Gaussian matrix, and the dimension of the sketching vector is sufficiently large, the rank of sample covariance matrix for these sketches equals the rank of the original sample covariance matrix with high probability. Hence, we may be able to detect the low-rank change using sample covariance matrices of the sketches without having to recover the original covariance matrix. We character the performance of the largest eigenvalue statistic in terms of the false-alarm-rate and the expected detection delay, and present an efficient online implementation via subspace tracking.
Prediction Error Reduction Function as a Variable Importance Score
This paper introduces and develops a novel variable importance score function in the context of ensemble learning and demonstrates its appeal both theoretically and empirically. Our proposed score function is simple and more straightforward than its counterpart proposed in the context of random forest, and by avoiding permutations, it is by design computationally more efficient than the random forest variable importance function. Just like the random forest variable importance function, our score handles both regression and classification seamlessly. One of the distinct advantage of our proposed score is the fact that it offers a natural cut off at zero, with all the positive scores indicating importance and significance, while the negative scores are deemed indications of insignificance. An extra advantage of our proposed score lies in the fact it works very well beyond ensemble of trees and can seamlessly be used with any base learners in the random subspace learning context. Our examples, both simulated and real, demonstrate that our proposed score does compete mostly favorably with the random forest score.
Spectrum of Variable-Random Trees
Liu, F. T., Ting, K. M., Yu, Y., Zhou, Z. H.
In this paper, we show that a continuous spectrum of randomisation exists, in which most existing tree randomisations are only operating around the two ends of the spectrum. That leaves a huge part of the spectrum largely unexplored. We propose a base learner VR-Tree which generates trees with variable-randomness. VR-Trees are able to span from the conventional deterministic trees to the complete-random trees using a probabilistic parameter. Using VR-Trees as the base models, we explore the entire spectrum of randomised ensembles, together with Bagging and Random Subspace. We discover that the two halves of the spectrum have their distinct characteristics; and the understanding of which allows us to propose a new approach in building better decision tree ensembles. We name this approach Coalescence, which coalesces a number of points in the random-half of the spectrum. Coalescence acts as a committee of ``experts'' to cater for unforeseeable conditions presented in training data. Coalescence is found to perform better than any single operating point in the spectrum, without the need to tune to a specific level of randomness. In our empirical study, Coalescence ranks top among the benchmarking ensemble methods including Random Forests, Random Subspace and C5 Boosting; and only Coalescence is significantly better than Bagging and Max-Diverse Ensemble among all the methods in the comparison. Although Coalescence is not significantly better than Random Forests, we have identified conditions under which one will perform better than the other.